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fH . Given a finite set K, we denote hy X = A{K) the set of probabilities on K and by 

I Z = Af{X) the set of Borel probabihties on X with finite support. Studying a Markov 

^ ■ Decision Process with partial information on K naturally leads to a Markov Decision 

I— I. Process with full information on X. We introduce a new metric d^: on Z such that 

\ the transitions become 1-Lipschitz from (X, ||.||i) to {Z,d^). In the first part of the 
article, we define and prove several properties of the metric d*. Especially, satis- 

0^ ■ fies a Kantorovich-Rubinstein type duality formula and can be characterized by using 

I disintegrations. In the second part, we characterize the limit values in several classes 

■ of "compact non expansive" Markov Decision Processes. In particular we use the me- 

■ trie to characterize the limit value in Partial Observation MDP with finitely many 
, states and in Repeated Games with an informed controller with finite sets of states 

■ and actions. Moreover in each case we can prove the existence of a generalized notion 
•• ■ of uniform value where we consider not only the Cesaro mean when the number of 

. stages is large enough but any evaluation function 6 € A{1N*) when the impatience 

^ ! = Ylt>i \^t+i -Ot\ is small enough. 
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1 Introduction 

The classic model of Markov Decision Processes with finitely many states, 
particular class of the model of Stochastic Games introduced by Shapley (1953), 
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was explicitly introduced by Bellman (1957) in the 1950s and has been extensively 
studied since then. When the set of actions is also finite, Blackwell (1962) proved 
the existence of a strategy which is optimal for all discount factors close to O.This 
model was generalized later to MDPs with Partial Observations (POMDP), (for 
references see Araposthatis et al. (1993)). The decision maker observes neither 
the state nor his payoff. Instead at each stage, he receives a signal which depends 
on the previous state and his previous action. In order to solve this problem a 
classic approach is to go back to the classic model of MDPs by introducing an 
auxiliary problem with full observation and Borel state space : the space of belief 
on the state as shown in Astrom, K.J. (1965), Sawaragi and Yoshikawa (1970) 
and Rhenius (1974). For optimality criteria like the Cesaro mean and the Abel 
mean, these two problems are equivalent and the question of the existence of the 
limit value is the same. Then given some sufficient conditions of ergodicity, one 
can search for a solution of the Average Cost Optimality Criterion in order to find 
"the" value of the MDP, for example as in Runggaldier and Stettner(1991) or as 
in Borkar (2000,2007). An introduction to the ACOE in the framework of MDP 
and the reduction of POMDP can be found in Hernandez- Lerma (1989). From 
another point of view, if we know that the limit value exists, the ACOE may be 
used as a characterization of the value. For finite MDP, for example, Denardo and 
Fox (1968) proved that the limit value is the solution of a linear programming 
problem deduced from the ACOE. Moreover by standard linear programming 
results, it is also equal to the solution of a dual problem from which Hordjik and 
Kallenberg (1979) deduced an optimal strategy. This dual problem focuses on 
the maximal payoff that the decision maker can guarantee on invariant measures. 
This approach was extended to different criteria (see Kallenberg 1994 ) and to a 
convex analytic approach by Borkar ( for references see Borkar 2002) in order to 
study problems with a countable state space and a compact action space. 

Given an initial POMDP on a finite space K, we will follow the usual approach 
and introduce a MDP on. X — A(ir) but instead of assuming some ergodicity on 
the process we will use the structure of A(ii') and a new metric on Z = A/(A(ii')). 
We extend and relax the MDP on A/(X) with a uniformly continuous affine payoff 
function and non-expansive affine transitions. The structure of Z was already 
used in Rosenberg, et al. (2002) and in Renault (2011). Under our new metric, 
we highlight a stronger property since the transitions became 1-Lipschitz on Z 
and Z is still precompact. We use this property to focus on general evaluations. 
Given a probability distribution 9 on positive integers, we evaluate a sequence 
of payoffs g = {gt)t>i by 7^(5) = J2t^t9t- In a MDP or a POMDP, the ^-value 
is then defined as the maximum expected payoff that the player can guarantee 
with this evaluation. Most of the hterature focuses on the n-stage game where we 
consider the Cesaro mean of length n, and on the A discounted games, where we 
consider the Abel mean with parameter A. The first type of results focuses on the 
limit when n converges to -|-oo and when A converges to or the relation between 
them. When there is no player, the relation between them is directly linked to a 
Hardy-Littlewood theorem (see Filar and Sznajder, 1992). One of the fimit exists 
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if and only if the other exists and whenever they exist they are equal. Lehrer and 
Sorin (1992) proved that this result extends to the case where there is one player 
provided we ask for uniform convergence. The other approach focuses on the 
existence of a good strategy in any long game or for any discount factor close to 
0. We say that the MDP has a uniform value. For MDP with finitely many states, 
Blackwell's result (1962) solved both problems. In POMDPs, Rosenberg, et al. 
(2002) proved the existence of the uniform value when the sets of states, actions 
and signals are finite, and Renault (2011) removed the finiteness assumption on 
signals and actions. 

Concerning stochastic games, Mertens and Neyman (1981) proved the exis- 
tence of the uniform value when the set of states and the set of actions are finite. 
The model also generalizes to partial information but the existence of possible pri- 
vate information implies a more complex structure on the auxiliary state space. 
Mertens and Zamir (1985) and Mertens (1987) introduced the universal belief 
space which synthesizes all the information for both players in a general repeated 
game : their beliefs about the state, their beliefs about the beliefs of the other 
player, etc... So far, the results always concern some subclasses of games where we 
can explicitly write the auxiliary game in a "small" tractable set. A lot of work 
has been done on games with one fully informed player and one player with par- 
tial information, introduced by Aumann and Maschler (see reference from 1995). 
A state is chosen at stage and remains fixed for the rest of the game. Renault 
(2006) extended the analysis to a general underlying Markov chain on the state 
space (see also Neyman, 2008). Rosenberg et al. (2004) and Renault (2012a) pro- 
ved the existence of the uniform value when the informed player can additionally 
control the evolution of the state variable. 

The first section is dedicated to the description of the (pseudo)-distance on 
A(X) in the general framework when X is a compact subset of a normed vector 
space. We provide different definitions and show that they all define this pseudo- 
distance. Then we focus on the case where X is a simplex. We prove that is a 
real metric and prove a "Kantorovich- Rubinstein like " duality formula for proba- 
bilities with finite support on X. We give new definitions and a characterization 
by the disintegration mapping. The second section focuses on Gambling Houses 
and standard Markov Decision Processes. We first introduce the definitions of 
general limit value and general uniform value. Then we give sufficient conditions 
for the existence of the general uniform value and a characterization in several 
"compact" cases of Gambling Houses and Markov Decision Processes, including 
the finite state case. We study the hmit value as a linear function of the initial 
probability so there are similarities with the convex analytic approach, but we 
are able to avoid any assumption on the set of actions. Moreover the MDPs that 
we are considering may not have 0-optimal strategies as shown in Renault (2011). 
Finally we apply these results to prove the existence of the general uniform value 
in finite state POMDPs and repeated games with an informed controller. 
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2 A distance for belief spaces 



2.1 A pseudo-distance for probabilities on a compact sub- 
set of a normed vector space 

We fix a compact subset X of a real normed vector space V. We denote by 
E = C{X) the set of continuous functions from X to the reals, and by Ei the set 
of 1-Lipschitz functions in E. We denote by A(X) the set of Borel probability 
measures on X, and for each x in X we write 6x for the Dirac probability measure 
on X. It is well known that A(X) is a compact set for the weak-* topology, and 
this topology can be metrizable by the (Wasserstein) Kantorovich-Rubinstein 
distance : 

yu,v G A(X), dKR{u,v) = sup u{f) -v{f). 

We will introduce a pseudo-distance on A(X), which is not greater than cIkr 
and in some cases also metrizes the weak-* topology. We start with several defi- 
nitions, which will turn out to be equivalent. Let u and v be in A(X). 

Definition 2.1. 

di{u,v) = sup u{f) - f (/), 

feDi 

where Di = {f E E,Wx, y G X, Va, b > 0, af{x) — bf{y) < \\ax — by\\}. 

Note that any linear functional in V with norm 1 induces an element of Di. 
di is a pseudo-distance on A(X), and di{u,v) = supjg^^ ~ v{f)\, since if 

/ is in Di, —f is also in Di. We also have Di C Ei, so that di{u,v) < dxRiu^v) 
and the supremum in the definition of di{u,v) is achieved. 

Given x and y in X, there exists a linear functional / in V with norm 1 
such that f{y — x) = \\y — Then the restriction of / to X is in Di and 
di{5x, Sy) > \\x — y\\. One can easily deduce that di{6x, Sy) = \\x — y\\ for x and y 
in X. 

Example 2.2. Consider the particular case where X = [0, 1] endowed with the 
usual norm. Then all / in Di are linear. As a consequence, di{u,v) = for 
u = 1/2 6o + 1/2 6i and v = 6i/2- We do not have the separation property and di 
is not a distance in this case. 

Let us modify the example. X now is the set of probability distributions 
over 2 elements, viewed as X = {(x, 1 — x),x G [0, 1]}. We use the norm 
to measure the distance between (x, 1 — x) and {y, 1 — y), so that V = is 
endowed with ||(xi,X2) — (|/i,2/2)|| = \xi — yi\ + \x2 — y2\- Consider f in E such 
that /((x, 1— x)) = x(l — x) for all x. / now belongs to Di, and di{u, f ) > 1/4 > 
for u = 1/26q + 1/2 5i and v = 6i/2- One can show that {A{X),di) is a compact 
metric space in this case (see proposition 12. 151 later) . and for applications in this 
paper di will be a particularly useful distance whenever X is a simplex A{K) 
endowed with ||x — y\\ = J2keK \^'^ ~ v'^l- 
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Furthermore it is known that the Kantorovitch Rubinstein metric on A(X) 
only depends on the restriction of the norm ||.|| on the set X. Especially if for all 
G X such that x 7^ x', \\x — x'\\ = 2, then for all u,v E A(X), dxRiu^v) = 
\\u — v\\i. This is not the case when considering the metric di. Two norms on V 
giving the same metric on X may leads to different pseudo-metrics on A(X). We 
consider in the next example different norms on the Euclidean space . 

Example 2.3. We consider V = , X = {ci, ..,e/^} the set of canonical vec- 
tors of V and a norm such that for all k 7^ k', \\ek — ek'\\ = 2. We know that 
di is smaller than the Kantorovitch-Rubinstein metric, so for all u G A(X) and 
V G A(X), we have di{u, v) < \\u — v\\i. 

We first consider the particular case of the norm defined by ?/|| =2 p — 

X]fc=i 1^*: ~ yk\^) is the usual L^-norm on , with p 
a fixed positive integer. Given u,v E A(X), the function / defined by 

yke K f{k) 



1 iiu{k)>v{k) 
— 1 otherwise, 



satisfies u{f) — v{f) = J2keK l^(^) ^ '^(^)l — 11^ ~ ^11 1- Moreover for all a > 0, 
6 > and k,k' E K such that k 7^ k' , we have 

af{k) - bf{k') <a + b<^{a^ + Ff'' = \\aek - bek>l 

and af{k) — hf{k) < \a — h\ < \a — h\-^. Therefore / is in Di and di{u,v) = 
— f 111, independently0 of p. 

Nevertheless the inequality di{u, v) < \\u—v\\i may be strict as in the following 
example. We consider the case K = 3 and given a vector {xi,X2,X3) G M^, we 
define the norm || (xi, X2, a;3)|| = max(|xi| + |a;2|, 2|a;3|), which satisfies ||ei— e2|| = 
11^2 ~ 63 1 1 = 1163 — 6211 =2. Let / be a function in Di, then we have among others 
the following constraints : 

Va,6>0 af{e^)-hf{ei) < ||(-6, 0, a)|| = max(2a, 6) 
and Va > 0/(62) < || (0, a, 0) || = a. 

Let u = (0, 1/2, 1/2), V = (1, 0, 0) and / G Di, then 

«(/) - vif) = ^/(e2) + Ifies) - /(6i) < ^ + max(2/2, 1) = ^. 

By symmetry of Di, we deduce that di{u, w) < | < ||m — f ||i. In fact one can show 
that di{u,v) = I by checking that the function defined by /(61) = 0, /(62) = 1 



and /(63) = 2 is in Di and satisfies u{f) — v{f) 
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1. Similarly, the same result holds for the case p = +00, i.e. where ||a; — y|| = 2||x — y|| 
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We now give other expressions for the pseudo-distance di. 
Definition 2.4. 

d2{u,v)= sup u{f)+v{g), 
where D2 = {{f,g) e E x E,\lx,y e X,\la,h > 0, af{x) + bg{y) < \\ax - by\\}. 



Definition 2.5. 



d2{u,v) = mi dl{u,v), where dl{u,v) = sup u{f) + v{g) 
and^e > 0, = {{f,g) G ExEyx,y G X,Va,6 G [0,1], af{x)+bg{y) < e+\\ax-by\\}. 

Definition 2.6. 

d^{u,v) = inf / II Ax — /i?/||(i7(x, A, /i), 

where A^3(M,f) is the set of finite positive measures on X"^ x [0, 1]^ satisfying for 
each f in E : 



fif{y)d-f{x,y,X,^) = v{f). 

' {x,y,\,fj.)eX^ X [0,1]2 J (x,j/,A,/^)eX2 X [0,1]2 



/ >^f{x)d-f{x,y,X,fi) = u{f),and / 

Jfx,y,A,Aj)eX2x[0,l]2 J(x 

In the next subsection we will prove the following result. 



Tlieorem 2.7. For allu andv in A(X), di{u, v) = d2{u, v) = ^2 v) = d^{u, v) 



2.2 Proof of theorem [277 



The proof is split into several parts. 

Proposition 2.8. di = d2 = (ij. 

It is plain that rfi < ^2 < c?2^, so all we have to prove is < c?i. We start 
with a lemma. 

Lemma 2.9. Fix e > 0, and let f in E be such that : Vx G X, Va G [0,1], 
^/(x) < e + a\\x\\. Define f by : 

Wy G X, f{y) = ^ inf ^ (e + ||ax - by\\ - af{x)) . 

Then for each y in X , —\\y\\ < fiy) < —f{y) + Moreover f G £"1, and .■ 
Vx G X, Vy G X, Va G [0, 1], V6 G [0, 1], a/(x) - bf{y) < ae + \\by - ax\\. 
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Proof of lemma 12.91 : By assumption on /, we have for all y in X, a in [0, 1], b 

in (0, 1], X in X : ^ {e + \\ax — by\\ — af{x)) > ^ (— a||x|| + \\ax — by\\) > —\\y\\- In 
the definition of f{y), considering a = b = 1 and x = y yields f{y) < —f{y) + £■ 
Fix X and ?/ in X, a and b in [0, 1]. We have : 

af{x) — bf{y) = a ini 77 (e^ + \\ax' — b'x\\ — a' f{x')) 

a' ,b' ,x' 

-b M ^^^{e+\\a"x"-b"y\\-a"f{x")). 

a",b",x" 

If a = 0, then the inequality f{y) > —\[y\\ leads to —bf{y) < b\\y\\. If 6 = 0, 
choose a' = 0, 6' = 1 and x' = x to get af{x) < ae + \\ax\\. 

If ab > 0, given rj > 0, choose a", b" , x" //-optimal in the second infimum. We 
can define x' = x", and choose a' G [0, 1] and b' G (0, 1] such that f?- = ^Itt- We 
obtain : 



a b a" a" 

af{x)-bf{y) < br] + {- - —)e + {\\—bx" - ax\\ - \\—bx" - by\\) 

a b 

- bV + {j^-yj)^+\\ax-by\\. 

If a = 6 > 0, choose a' = a" and b' = b" to obtain : f{x) — f{y) < — ?/|| and 
therefore / is 1-Lipschitz. 

Otherwise, we distinguish two cases. If |6" < 1, we define b' = |6" and a' = a" 
and we get af{x) — bf{y) < br] + \\ax — by\\. If |6" > 1, we define b' = 1 and 



a 



I _ a"b 



^77^ G [0, 1] and obtain af{x) — bf{y) < brj + ae + \\ax — by\\. Thus for all 
7] > 0, we have 

O'fi.x) -bf{y) < bri + ae+Wax-byW, 
and therefore af{x) — bf{y) < ae + \\ax — by\\. □ 



Proof of proposition 12.81 : Fix u and v in A(X), and consider e > 0. For each 
(/, 9) in Dl, we have -/ + e > / > ^ and (/, /) in Dl. We also have (/, /) G D| 

so iterating the construction, we get (/, /) G D^, and — / + e > f > f . 

Now, u{f) +v{g) < u{f) + v{f) < —u{f) + e + v{f). Hence we have obtained : 

d'2{u,v)<e+ sup -u{f)+v{f), 
where Ce{u,v) is the set of functions / in Ei satisfying : 

Vx G X,V?/ G X,Va G [0,1], V6 G [0,1], af{x)-bf{y) < ae+\\ax-by\\ and f{y) > - 

For each positive k, one can choose fk in Ei achieving the above supremum for 
e = 1/k. Taking a limit point of {fk)k yields a function / in Di such that : 
—u{f) + v{f) > The function /* = — / is in Di and satisfies u{f*) — 

v{f*) > d2{u,v), and the proof of proposition 12.81 is complete. □ 



Proposition 2.10. rf^ > ^3. 

Proof : The proof is based on (a corollary of) Hahn-Banach theorem. Define : 

H = C{X^ X [0, and 

L = {ipeH,3f,ge C{X) s.t. Vx, t/ e X, VA, e [0, 1], ip{x, y, A, /x) = Xf{x)+fxg{y)}. 

H is endowed with the uniform norm and L is a linear subspace of H . Note that 
the unique constant mapping in L is 0. Fix u and v in A(X), and let r be the linear 
form on L defined by r{(p) = u{f) + v{g), where (p{x, y, A, /i) = Xf{x) + ng{y) for 
all X, y, A, n- 

Fix now £ > 0, and put : 

Us ^ {(p e H,\/x,y e X,\/X,n e [O, l],ip{x,y, X,n) < II Ax - ny\\ + e}. 

We have : 

sup r{ip) = d2{u, v). 
(peLnUe 

is a convex subset of H which is radial at 0, in the sense that : V(/9 G i/, 35 > 
such that t^p G as soon as \t\ < 5. By a corollary of Hahn-Banach theorem (see 
theorem 6.2.11 p. 202 in Dudley, 2002), r can be extended to a linear form on H 
such that : 

sup r{(p) — v). 

Given G i?, we have £(/7/||99||oo G t4, which implies that r( 99) < \\ip\\ood2{u,v)/6, 
so that r belongs to H'. And if (/? > 0, we have tcp G ii t < 0, so that 
r{ip) > d\{u,v)/t for all t < and r(</?) > 0. By Riesz Theorem, r can be 
represented by a positive finite measure 7 on x [0, 1]^. 

Given / in one can consider ipj E L defined by (pf{x, y, A, //.) = Xf{x). r{ip = 
f) = ^iVf) gives : u(f) = J^^y^^^^^^^^^^^^^^^Xf{x)d-f{x,y,X,fi), and similarly 
^(Z) = I{x,y,\,n)&x2x[o,i]2 ^f{y)dl{x,y, X,jj), and we obtain that 7 G A^sIm,'^)- 

Because 7 > 0, sup^^j;^ r((/7) = r(</7*) where (f*{x,y, X, /i) — \\Xx — iiy\\ + e. 
We get dl{u, v) = /^2x[o,i]2 11^^ - l^y\\dli^, ^> 1^) + ^li^^ x [0> so 

dl{u,v)> / llAx - /xy||(i7(x,|/, A,/x) > c?3(m,v). 

^X2x[0,l]2 

□ 

Lemma 2.11. da > d2- 

Proof : Fix {f,g) G D2 and 7 G A^3(M,t'). 

u{f) + v{g) = / A/(x)d7(a;,y, A,//) + / fxg{y)d-f{x,y, X, fj,) 

= / {^f{x)+ l^g{y))d-f{x,y,X,fx) 

Jx 



'Js:2x[o,i]2 

< / \\Xx - ny\\d^{x,y,X,iJ,). □ 

^X2x[0,l]2 
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2.3 The case of probabilities over a simplex 



We assume here that X = A{K), where K is a non empty finite set. We use 
Ibll — J2k \P^\ every vector p = {p'')keK in , and view X as the set of 
vectors in with norm 1. 

Recall that for u and v in A(X), we have di{u,v) = supjg^)^ |^(/) where 
D, = {fE E,yx,y e X,Wa,b> 0, af{x) - bf{y) < \\ax - by\\}. 

We now introduce an alternative definition of di using "non revealing game 
functions" . These functions come from the theory of repeated games with incom- 
plete information d la Aumann Maschler (1995), and the interest for the distance 
do emerged several years ago while doing research on Markov decision processes 
with partial observation and repeated games with an informed controller (see 
Renault 2011 and 2012a). 

Given a collection of matrices {G'^)keK (all of the same finite size I x J) 
indexed by K and with values in [—1, 1], we define the "non revealing function" 
/ in C{X) by : 

ypeXJip) = Val(5^/G^ 

\keK 

= max min > x(i)y(j) 

xeA(/) yeAiJ) j-' 
iei,jeJ 

= min max > x(i)y(j) > p^G^ii, j) 1 . 
^ ^ ^ ieijeJ \k&K / 

f{p) is the minmax value of the average matrix YlkP'^^^- "^^^ such non 

revealing functions /, where /, J and {G^)keK vary, is denoted by Do- 
Clearly, all affine functions from X to [—1,1] belong to Dq. It is known that the 
set of non revealing functions is dense in C{X). However, we only consider here non 
revealing functions defined by matrices with values in [—1, 1], and Do is not dense 
in the set of continuous functions from X to [—1, 1]. As an example, consider the 
case where K = {1,2} and / in is piecewise-linear with /(1,0) = /(0, 1) = 
and /(1/2, 1/2) = 1. If a function g in Dq is such that (7(1/2,1/2) = 1, then 
necessarily the values of the two matrix games G^ and G^ are also equal to 1 
since it is the maximum value. Therefore / is not in Dq. In fact / is 1-Lipschitz, 
however 2/(1/2, 1/2) - /(I, 0) = 2 > ||2(l/2, 1/2) - (1, 0)|| = 1, so it is not in Di 
which we will see later contains Do (see lemma l2.13p . 

Lemma 2.12. ///, g belong to Dq and X G [0, 1], then —f, sup{f,g}, inf {f,g} 
and A/ + (1 — A)^^ are in Dq. The linear span of Dq is dense in C{X). 
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Proof : The proof can be easily deduced from proposition 5.1. page 357 in MSZ, 
part B. For instance, let / and g in Dq be respectively defined by the collections 
of matrices {G^)k^K with size /i x Ji and {H^)k^K with size I2 x J2. 

Defining for each k, ii, ji : G"'{ii,ji) = —G^{ji, ii) yields a family of matrices 
{G"')k with size Ji x Ji inducing — /. So — / G -Do- 
To get that sup{/, (?} belongs to Dq, one can assume w.l.o.g. that Ii H I2 = 
Ji n J2 = 0. Set I = Ii U I2 and J = Ji x J2. Define for each k the matrix 
game in R'^-^ by L'=(i, (ji, j^)) = ^^(i, Ji) if i e h, L\i, (ji, j^)) = J2) 
if i e /2. Then for each p in X, we have Val(5^j^.p*^L'^) = sup{/(p), g'(p)}, so that 
sup{/,t/} e L>o- 

Lemma 2.13. The closure of Dq is D^. 

Proof : We first show that Dq d Di. Let / and J be finite sets, and {G'^)k£K be 
a collection of / x J-matrices with values in [—1, 1]. Consider p and g in X and 
a and b non negative. Then for all i and j : 



j2p''aG'{i,j)-j2<i'bG'{t,j: 



k 

As a consequence. 



< ^ |ap^ - bq''\ = \\ap - bq\\. 



aVal j J2 P^G^ ) - ^Val ( q^G'' j = Val j J] ap^G^ ) " ^^M 5Z ^^''^^ ) 

< \\ap — bq\\ 

We now show that the closure of Dq is Di. Consider / in Z^i, in particular we 
have II /II 00 ^ 1- Let P and q be distinct elements in X, and define Y as the linear 
span of p and q, and define ip from y to iR such that : ip{Xp+ /iq) — Xf{p) + iJ^fiq) 
for all reals A and /i. 

If A > and > 0, we have ip{\p + fiq) < A + = ||Ap + fiq\\. If A > 
and /i < 0, we directly use the definition of Di to get : (p{\p + fiq) < ||Aj9 + fiq\\. 
As a consequence, (/? is a linear form with norm at most 1 on Y. By Hahn- 
Banach theorem, it can be extended to a linear mapping on with the same 
norm, and we denote by g the restriction of this mapping to X. is affine with 
g{p) = ip{p) = f{p) and g{q) — (f{q) = f{q)- Moreover, for each r in X, we have 
||gi(r)|| < ||r|| = 1. As a consequence g belongs to Dq. 

Because Do is stable under the sup and inf operations, we can use Stone- 
Weierstrass theorem (see for instance lemma A7.2 in Ash p. 392) to conclude that 
/ belongs to the closure of i^o- D 

Definition 2.14. Given u and v in A(X), define : 

do{u,v) = sup u{f) - v{f) 
feDo 
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Proposition 2.15. do is a distance on A(X) metrizing the weak-* topology. 
Moreover do = di = d2 = d^. 



Proof : do = di = d2 = d^ follows from lemma 12.131 and theorem 12.71 Because 
the linear span of Dq is dense in C{X), we obtain the separation property and 
do is a distance on A(X). Because Dq C Di C Ei, we have do = di < dxR- 
Since {A{X),dKR) is a compact metric space, the identity map {A{X),dKR) to 
(A(X),(io) is bicontinuous, and we obtain that {A{X),do) is a compact metric 
space and do and dxR are equivalent, (see for instance proposition 2 page 138 
Aubin). □ 

Remark : one can show that allowing for infinite sets /, J in the definition of 
Do (still assuming that all games YlkP^^'^ have a value) would not change the 
value of do. 

From now on, we just write d^{u,v) for the distance do = di = d2 = ds on 
A(X). Elements of X can be viewed as elements of A(X) (using Dirac measures), 
and it is well known that for p, q in X, we have : dKR^Sp, 6q) = \\p ~ q\\- We have 
the same result with d^,. 

Lemma 2.16. Forp, q in X , we have d^{5p,5q) = \\p — q\\- 

Proof : Define Ki = {k G K,p^ > g^}, and K2 = K\Ki. Consider / affine on 
X such that f\k) = +1 if A; G Ki, and f\k) = -1 if A; G K2. Then / e and 
rf*(5p, 5q) > \ f{p) — f{q)\ = \\p — q\\- The other inequality is clear. □ 

We now present a dual formulation for our distance, in the spirit of Kantoro- 
vich duality formula from optimal transport. For any u, v in A(X), we denote by 
n(M, v) the set of transference plans, or couplings, of u and f , that is the set of 
probability distributions over X x X with first marginal u and second marginal 
V. Recall (see for instance Villani 2003, p. 207) : 



dKR{u,v) = SM\i \u{f) - v{f)\ = min / \\x-y\\d^{x,y) 

We will concentrate on probabilities on X with finite support. We denote by 
Z = Af{X) the set of such probabilities. 

Definition 2.17. Let u and v he in Z with respective supports U and V . We 
define M-^iu^v) as the set 

I (a,/3) e (iR+^><^)2,s.t.Vx e ?7,Vi/ e J] a{x,y') = u{x) and J] I3{x\y) = v{y) 
And dJu,v) = inf } \\xa(x,y) — yl3(x, 

{x,y)eUxV 
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Notice that diagonal elements in A^4(m, v), i.e. measures a such that {a, a) G 
A44{u,v), coincide with elements of Il{u,v). Ai4{u,v) is a polytope in the Eucli- 
dean space {IR^^^y, so the infimum in the definition of d4{u,v) is achieved. 

Theorem 2.18. (Duality formula) Let u and v be in Z with respective supports 
U and V. 

d^{u,v)= sup \u{f)-v{f)\^ min V \\xa{x,y) - yP{x,y)\\ 

feDi {a,p)eM4{u,v) ^ 

(x,y)eUxV 

where Di = {f E E, \/x, y E X, Va, b > 0, af{x) — bf{y) < \\ax — by\\}, 
and M4{u,v) = {(q;,/3) e iR+^^^ x R+^''^,s.t. y{x,y) e U x V, 

The proof is postponed to the next subsection. We conclude this part by a 
simple but fundamental property of the distance d*. 

Definition 2.19. Given a finite set S, we define the posterior mapping ips from 
A{K X S) to A{X) by : 

where for each s, 7r(s) = X]fe^(^'*) o-nd p{s) — ip^{s))keK E X is the posterior 
on K given s (defined arbitrarily if t:{s) — Q): for each k in K, p^{s) — ^^^^-^ ■ 

-05 (tt) is a probability with finite support over X. Intuitively, think of a joint 
variable (/c, s) being selected according to tt, and an agent just observes s. His 
knowledge on K is then represented by p{s). And V'sC^r) represents the ex-ante 
information that the agent will know about the variable k. A{K x S) is endowed 
as usual with the ||.||i norm. One can show that ips is continuous whenever X is 
endowed with the weak-* topology. Intuitively, ipsi''^) has less information than 
TT, because the agent does not care about s itself but just on the information 
about k given by s. So one may hope that the mapping ips is 1-Lipschitz (non 
expansive) for a well chosen distance on A(X). This is not the case if one uses 
the Kantorovich-Rubinstein distance dxR, as shown by the example below : 

Example 2.20. Consider the case where K = {a, 6, c} and S — {a, We 
denote by tt and tt' the following laws on A{K x 5") ; 



K 



s 




s 






a 





U 


1 and 1 















TT 
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Their disintegrations are respectively '^'5(71") = | ( + ^ I 1 and ipsiT^') 



.0; 



i I I + f ( i I • M^e define the test function f : A{K) [-1, 1] by 



.0, 



1 



,0 

'0\ /I 



3' 



2 



/||J=1 and /(0^-3 



^3/ 



We have ||7r— 7r'|| = | and since f is 1-Lipschitz, dKR{'ipsij),4^s{.T^')) > 4^sij'){.f) — 
il^s{T^){f ) = n ~ ^ ^ i- -^^^ posterior mapping ips is not 1-Lipschitz from 
{A{KxS),\\.\\i) to (AiX), dKR) . 

However, the next proposition shows that the distance has the desirable 
property. 

Proposition 2.21. For each finite set S, the mapping ips is 1-Lipschitz from 
(A(i^ x5),||.||i) to (A/(X),4). 

Moreover, d^ is the largest distance on Z having this property : given u and v 
in Z, we have 

d^{u,v) = mf{\\TT — tt'Wi, s.t. S finite, ipsi''^) = u, ■ipsi''^') = v}. 

Proof : First fix 5" and tt, tt' in A{K x S). Write u = ips{'^)^ u' = ipsi^^')- For 
any / in Di, we have : 

U{f)-U\f) = Y.^'K{s)f{p{s))-^'{s)f{p'{s)) 

ses 

< J2Ms)p(s)-n'(s)p\s)\\ 

< J2\\{n{k,s)),-{7T'{k,sM 

— l^^jk, s) — T:'{k, s)\ = \\tt — tt'Wi. 

So dt.{u,u') < \\tt — tt'IIi, and ips is 1-Lipschitz. 

Let now u and v be in Z. There exists (a, /3) E M.4{u, v) such that 
d*(u,v)= \\a{x,y)x - /3{x,y)y\\. 

{x,y)eUxV 
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Define S = U x V and tt.tt' G A{K x 5*) by 7T{k,{x,y)) = x{k)a{x,y) and 
7i'{k, {x,y)) = y{k)/3{x,y). By definition of Ai4{u,v), n and tt' are probabiiities 
and 

k&K,{x,y)£UxV 
(a;,j/)ei7xy 

□ 

2.4 Proof of the duality formula 

Let u and v be in A(X), and denote by U and tlie respective supports of u 
and V. We write 5" = x [0, 1]^, and we start with a lemma, where no finiteness 
assumption on [/ or y is needed. 

Lemma 2.22. For each 7 G A43{u,v), we have : 

/ \\>^x-fxy\\d-f{x,y,X,fj,)^2+ {\\Xx - fxy\\ - X - fj,) d-f{x,y, X, fx). 

Jx2x[0,l]2 JuxVx[0,l]^ 

Proof : Write ^(7) — \\Xx — ijLy\\d'y{x,y, X, /i). By definition of A43{u,v), we 
have : 

/ Xlx^ud'y = 0, and / jj.ly^yd'y = 0. 
Js Js 

So that Xlxfu — IJ'^y^v = 07. a.e. We can write : 

^(7)= / '^xeu,yev\\>^x- ny\\d^{x,y,X,n)+ / l^^u,yfv\\>^x - iiy\\d^{x,y, X, fx) 
Js Js 

+ / 'i-xfu,yev\\>^x - fiy\\d^{x,y, X, jj,) + / l^^u,yfv\\>^x - iiy\\d^{x,y, X, fx) 
Js Js 

= / ixeu,yev\\>^x-ny\\d^{x,y,X,n)+ Ueu,yfv>^d^{x,y, X, lxfu,yevl^d^{x,y, X, 

Js Js Js 

We also have by definition of M.3{u, v) that ^ — Jg Ix^u^d'j, so that : 



1 = / '^xeu,yev^d^ + / lxeu,yfv^dj. 
Js Js 

And similarly 1 = ^xeu,yevl^dj + J^^ ^x<^u,yevl^dj. We obtain : 

^(7) = 2+ / la;ef7,j,gv||Aa;-/xy||d7(a;,y, A,/x)- / lxeu,yev><d'y - / lxeu,yevfJ'd'y. 
Js Js Js 

□ 

We assume in the sequel that U and V are finite, and define 0^5 (m, as follows : 
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Definition 2.23. Define 

Q;(a:;,y) > 0,/3{x,y) > 0, ^ a{x,y') < u{x) and ^ < v{y)}. 

Ancici5(M,'y) = ^ ^inf 2+ V (||a;Q;(x, y) - - y) - y)) . 

(a;,j/)G!7xV 

A^5(?i,?;) is a polytope in the Euclidean space {IR^^^y, so the infimum in 
the definition of d5{u,v) is achieved. 

Lemma 2.24. ds{u,v) > dz{u,v). 

Proof : Let 7 be in }A^{u,v). Fix for a while {x,y) mU xV , and assume that 
■j{x,y) > 0. We define 'y{.\x,y) the conditional probability on [0, 1]^ given {x,y) 
by : for all (p e C([0, l]^), 

/ <^(A, /x)ci7(A, y) = ^ [ 'i-x'=x,y'=y^i\ p)d-i{x' , y', A, //) . 

^[o,i]2 l\x,y) J (x' ,y' ,\,ti)eS 

So that 

l{x,y) I {\\\x-fxy\\-\-fx)d^{X,n\x,y) = {\\Xx-ny\\-X-n)d^{x,y, X, 

7[o,i]2 7(A,M)e[o,i]2 

The mapping ^' : (A, /x) 1— )■ || Ax— — A— is convex so by Jensen's inequality 
we get : 

/ iW^x- ny\\- X- n)d-f{X,n\x,y)> 

J(A,,.)e[o,i]2 



/ Xd-f{X,ii\x,y)-y fxd-f{X, n\x,y)\\ 

7(A,m)€[0,1]2 7(A,;.)e[0,l]2 

■/ Xdj{X,fj.\x,y) - fj.d-f{X,fj.\x,y). 

^fA,o1e[0,ll2 ^fA,olG[0,112 



We write : 



P{x,y)= Xdj{X,iJ,\x,y) and Q{x,y) = iidj{X, iJ,\x,y), 

^(A,;.)e[0,l]2 7(A,M)e[0,l]2 



so that 



/ i\\>^x-ny\\-X-n)dj{X, n\x,y) > \\xP{x,y)-yQ{x,y)\\-P{x,y)-Q{x,y). 

'(A,/.)e[o,i]2 
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Now, by lemma fLT2\ 



A(7) = 2+ V / (||Ax-/iy|| - A-/i)ci7(x,y,A,/i) 

= 2+ / (||Ax-/i?/|| - A-/i)c/7(x,?/,A,/i) 

> 2+ lix,y)i\\xPix,y)-yQix,y)\\-Pix,y)-Q{x,y)). 

x£U,yeV,j{x,y)>0 

For (x,?/) in U X V, define a{x,y) = 'j{x,y)P{x,y) > and f3{x,y) = 
^{x,y)Q{x,y) > (with a{x,y) = /3{x,y) = ii^{x,y) = 0). We get : 

v4(7)>2+ Yl {\\Mx.y)-y^ix.y)\\-a{x,y)- ^{x,y)). 

And we have, for each x in U : 

= Y >^d-f{x,y,\,fj.) 

< / Ac/7(x,?/, A,/i) = 

^(y,A,/i)eXx[0,l]2 

where the last equality comes from the definition of A4.z{u,v). Similarly, for each 
y inV we can show that '^x&u P{,x,y) < v{y), and lemma [2^2^ is proved. □ 

Lemma 2.25. d^{u,v) > d4^{u,v). 

Proof : Consider achieving the minimum in the definition of d5{u,v). 

Assume that there exists x* such that X^^g^ v) < u{x*). For any a; in X and 
z in M^, one can check that the mapping I : {a i— ?■ ||sq; — z|| — a) is nonincreasing 
from to M (as the sum of the mappings : (a h-> \ax^ — — ax^), each 
/'^ being non increasing in a). As a consequence, one can choose any y* in V 
and increase a{x*,y*) in order to saturate the constraint without increasing the 
objective. So we can assume without loss of generality that J2yev (^{^*: V) = u{x*) 
for all X* and similarly ^a.g[/ /3(a;, y*) = v{ii*) for all y* . 
Consequently, 

d^{u,v) = 2+ Y -l//3*(a;,l/)|| - «*(a;,i/) - /3*(x,?/)) 

(x,j/)6C/xV 
(x,y)€UxV 

Lemma 2.26. d/^{u^v) > d2{u,v). 



16 



Proof : Fix {f,g) G D2 and G A^4(u,f). 



u{f)+v{g) = '^f{x)u{x) + ^g{y)v{y) 

= ^ f{x)a{x,y) + g{y)/3{x,y) 

{x,y)&UxV 

< ^ \\a{x,y)x - I3{x,y)y\\ < d4{u,v). 

(x,y)€UxV 

We have shown that d^(u,v) > d^iu^v) > d4{u,v) > d2{u,v) = d^iu^v) = 
di{u,v). This ends the proof of theorem 12 .ISi 

3 Long-term values for compact non expansive 
Markov Decision Processes 

In this section we consider Markov Decision Processes, or Controlled Markov 
Chains, with bounded payoffs and transitions with finite support. We will consider 
two closely related models of MDP and prove in each case the existence and a 
characterization for a general notion of long-term value. The first model deals 
with MDP without any explicit action set (hence, payoffs only depend on the 
current state), such MDP will be called gambling houses using the terminology of 
gambling theory (see Maitra and Sudderth 1996). We will assume in this setup 
that the set of states X is metric compact and that the transitions are non 
expansive with respect to the /Ti^-distance on A(X). Since we only use the KR- 
distance here, the theorem for the first model, namely theorem 13.91 does not 
use the distance for belief spaces studied in section [21 The second model is the 
standard model of Markov Decision Processes with states, actions, transitions and 
payoffs, and we will assume that the state space X is a compact subset of a simplex 
A{K). We will need for this second case an assumption of non expansiveness for 
the transitions which is closely related to the distance d^ introduced in section [21 
see theorem 13.191 later. The applications in sections 14.11 and 14.21 will be based on 
the second model. 

3.1 Long-term values for Gambling Houses 

In this section we consider Markov Decision Processes of the following form. 
There is a non empty set of states X, a transition given by a multi- valued mapping 
F : X ^ Af{X) with non empty values, and a payoff (or reward) function 
r : X [0,1]. The idea is that given an initial state xq in X, a decision-maker 
(or player) can choose a probability with finite support Ui in F{xo), then Xi is 
selected according to ui and there is a payoff r(xi). Then the player has to select 
U2 in F{xi), X2 is selected according to Ui and the player receives the payoff r{x2), 
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etc... Note that there is no explicit action set here, and that the transitions take 
values in Af{X) and hence all have finite support. 

We say that F = {X, F, r) is a Gambhng House. We assimilate the elements in 
X with their Dirac measures in A (X) , and in case the values of F only consist of 
Dirac measures on X, we view F as a correspondence from X to X and say that 
r is a deterministic Gambling House (or a Dynamic Programming problem). In 
general we write Z = Af{X), and an element in Z is written u = "^^ex '^i-'^)^^- 
The set of stages is ]N* — {1, ....}, and a probability distribution over stages 
is called an evaluation. Given an evaluation 6 = {6t)t>i snad an initial stage Xq 
in X, the ^-problem r6i(a;o) is the problem induced by a decision-maker starting 
from Xq and maximizing the expectation of X]t>i ^t^(^t)- 

Formally, we first linearly extend r and F to A/(X) by defining for each 
u = payoff r{u) — ^a-^x '^(•^)'"('^) ^^'^ transition 

F{u) = {E:rex'"(^)/(^)'S-^- / : ^ ^ ^ and f{x) G F(x)Vx G X}. We also 
define the mixed extension of F as the correspondence from Z to itself which 
associates to every u = X^^gx in ^f{^) the image : 

F{u) = I ^u{x)f{x), s.t. / : X ^ Z and f{x) G convF(a;) Va; e X 
Uex 

The graph of F is the convex hull of the graph of F. Moreover F is an affine 
correspondence, as shown by the lemma below. 

Lemma 3.1. Vm,m' G Va G [0,1], F{au + {1 - a)u') = aF{u) + {1 - a)F{u'). 

Proof : The C part is clear. To see the reverse inclusion, let f = a Xlrrex ^(^)/(^) + 
(1 — a) '"'(^)/'(^) be in aF{u) + (1 — a)F{u'), with transparent notations. 

Define 

au{x)f{x) + (1 — a)u'{x)f'(x) 
au{x) + (1 — a)u'{x) ' 
for each x such that the denominator is positive. Then g{x) G convF(x), and 

V = ^^(q;m(x) + (1 — a)u'{x))g{x) G F{au + (1 — a)u'). 
xex 

Definition 3.2. A pure play, or deterministic play, at xq is a sequence a = 
{ui, ...,Ut, ■■■) G Z°° such that Ui G F{xq) and Uf+i G F{ut) for each t > 1. 
A play, or mixed play, at xq is a sequence a = m^, ...) G Z'^ such that 

ui G convF(a;o) and Ut+i G F{ut) for each t > 1. We denote by S(a;o) the set of 
mixed plays at xo- 

A pure play is a particular case of a mixed play. Mixed plays corresponds to 
situations where the decision-maker can select, at every stage t and state Xf-i, 
randomly the law Ut of the new state. A mixed play at xo naturally induces a pro- 
bability distribution over the set (X x Aj(X))°° of sequences {xo,Uo,Xi,Ui, ...), 
where X and Z are endowed with the discrete cr-algebra and (X x A/(X))°° is 
endowed with the product u-algebra. 
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Definition 3.3. Given an evaluation 9, the 9-payoff of a play a = {ui, ...,Ut, ...) 

is defined as : 70 (cr) = '^^^i 6tr{ut), and the 6 -value at Xq is : 

ve^xo) = sup 7e((7). 
(Tes(xo) 

It is easy to see tiiat tiie supremum in tiie definition of ve can be taken over the set 
of pure plays at xq. We have the following recursive formula. For each evaluation 
& — {dt)t>i such that ^1 < 1, we denote by the "shifted" evaluation {^^^)t>i- 
We extend linearly ve to Z, so that the recursive formula can be written : 

e A(W*), Vx e X, ve(x) = sup {eir(u) + (1 - ei)vg+{u)) . 

uGconvF{x) 

And by linearity the supremum can be taken over F{x). It is also easy to see that 
for all evaluation 9 and initial state x, we have the inequality : 

\v0(x)- sup V0(u)\<9i + J2\^t-Ot-i\. (1) 

ueF{x) 

In this paper, we are interested in the limit behavior when the decision-maker 
is very patient. Given an evaluation 9, wc define : 

The decision-maker is considered as patient whenever I{9) is small, so I{9) may 
be seen as the impatience of 9 (see Sorin, 2002 p. 105 and Renault 2012b). When 
d — {9t)t>i is non increasing, then I{9) is just 9i. A classic example is when 
9 — "^^=1 n^t, the value ve is just denoted Vn and the evaluation corresponds to 
the average payoff from stage 1 to stage n. In this case I (9) — 1/n — )-n->oo 0. 
We also have I {9) = 1/n ii 9 = Xliim' some non-negative m. Another 
example is the case of discounted payoffs, when 9 = (A(l — A)*^^)t>i for some 
discount factor A G (0, 1], and in this case I{9) = A — >x^o 0. 

Definition 3.4. The Gambling House T = (X, F, r) has a general limit value v* 

'if {ve) uniformly converges to v* when I{9) goes to zero, i.e. : 

Ve > 0,3a > 0,V^, {I{9)<a ^ {"^x e X,\ve{x) - v*{x)\ < e) ) . 

The existence of the general limit value implies in particular that {vn)n and (wa)a 
converge to the same limit when n goes to -|-oo and A goes to 0. This is co- 
herent with the result of Lehrer and Sorin (1992), which states that the uniform 
convergence of (t'n)n and {vx)\ are equivalent. 

In the definition of the general limit value, we require all value functions to 
be close to v* when the patience is high, but the plays used may depend on the 
precise expression of 9. In the following definition, we require the same play to 
be simultaneously optimal for all 9 patient enough. 
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Definition 3.5. The Gambling House T = {X,F,r) has a general uniform value 
if it has a general limit value v* and moreover for each e > one can find a > 
and for each initial state x a mixed play a{x) at x satisfying : 

\/e, {1(6) < a =^ {\/xeX,-fg{a{x))>v*{x)-e)). 

Up to now, the literature in repeated games lias focused on the evaluations 
9 = J2t=i h^t ^ ~ (^(-'- ^ ^Y~^)t>i- The standard (Cesaro)-uniform value can 
be defined by restricting the evaluations to be Cesaro means : for each e > 
one can find Uq and for each initial state x a mixed play a{x) at x satisfying : 
Vn > no,Vx G X, 7„((j(x)) > v*{x) — e. Recently, Renault (2011) considered de- 
terministic Gambling Houses and characterized the uniform convergence of the 
value functions (fn)n- He also proved the existence of the standard Cesaro-uniform 
value under some assumptions, including the case where the set of states X is 
metric precompact, the transitions are non expansive and the payoff function is 
uniformly continuous. As a corollary, he proved the existence of the uniform value 
in Partial Observation Markov Decision Processes with finite set of states (after 
each stage the decision-maker just observes a stochastic signal more or less cor- 
related to the new state). 

We now present our main theorem for Gambling Houses. Equation (JT]) implies 
that the general limit value v* necessarily has to satisfy some rigidity property. 
The function v* (or more precisely its linear extension to Z) can only be an 
"excessive function" in the terminology of potential theory (Choquet 1956) and 
gambhng houses (Dubins and Savage 1965, Maitra and Sudderth 1996). 

Definition 3.6. An affine function w defined on Z (or A{X)) is said to be 
excessive if for all x in X, w{x) > sup (^^^ w{u) . 

Example 3.7. Let us consider the splitting transition given hj K a. finite set, 
X = A{K) and Va; G X,F{x) = {u e A(X), EpGx'«(p)P = Then the 
function w from Z = A(X) to [0, 1] is excessive if and only if the restriction of 
w to X is concave. Moreover given u,u' G A{X), u' G F{u) if and only u' is the 
sweeping of u as defined by Choquet (1956) : for all continuous concave functions 
/ from X to [0,1], u'{f)<u{f). 

Assume now that X is a compact metric space and r is continuous, r is natu- 
rally extended to an affine continuous function on A(X) by r{u) = /^gj^ ''^{p)du{p) 
for all Borel probabilities on X. In the following definition, we consider the closure 
of the graph of F within the (compact) set A(X x X). 

Definition 3.8. An element u in A(X) is said to be an invariant measure of 
the Gambling House F = (X, F, r) if {u,u) G cl{Graph F). The set of invariant 
measures of F is denoted by R, so that : 

R={ue A(X), {u, u) G c\{Graph F)}. 
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i? is a convex compact subset of A(X). Recall that for u and u' in A(X), the 
Kantorovich- Rubinstein distance between u and u' is denoted by dxRiu, u') = 
sup/6£;, \u{f)-u'f)\. 

Theorem 3.9. Consider a Gambling House T = {X, F, r) such that X is a 
compact metric space, r is continuous and F is non expansive with respect to the 
KR distance : 

Vx G X,Vx' e X,Vm G F(a;),3M' G F{x')s.t. dKR{u,u') < d{x,x'). 
Then the Gambling House has a general uniform value v* characterized by : 

yx G X, v*{x) = inf {w{x),w : A(X) ^ [0, 1] affine C° s.t. 

(1) \fy G X,w{y) > sup w{u) and (2)Vm G R,w{u) > r{u) }. 
«eF{j/) 

r/iat is, V* is the smallest continuous affine function on X which is 1) excessive 
and 2) above the running payoff r on invariant measures. 

Notice that : 

1) when r = (X, F, r) is deterministic, the hypotheses are satisfied as soon as 
X is metric compact for some metric c?, r is continuous and F is non expansive 
for d. 

2) when X is finite, one can use the distance d{x, x') = 2 for all x ^ x' in X, 
so that for u and u' in A(X), dKR{u,u') = — = ^^.gx —u'{x)\, and 
the hypotheses are automatically satisfied. We will prove later a more general 
result for a model of MDP with finite state space, allowing for explicit actions 
influencing transitions and payoffs (see corollary 13. 20p . 

Remark 3.10. The formula also holds when there is no decision maker, i.e. when 
F is single- valued, and there are some similarities with the Von Neumann ergodic 
theorem (1932). Let Z he a Hilbert space and Q be a linear isometry on Z, this 
theorem states that for all z G Z, the sequence Zn = ^ Yl^=iQ^i^) converges to 
the projection z* of z on the set R of fixed points of Q. Using the linearity and 
the non expansiveness leads to a characterization by the set of fixed points. In 
particular, having in mind linear payoff functions of the form {z H'< l,z >), we 
have that the projection z* of z on R is characterized by : 

yieZ,< I, z* >=< I*, z >= inf {< r, z>,l' eR and < r >>< /, r > Vr G R}. 

Example 3.11. We consider here a basic periodic sequence of and 1. Let 

X = {0, 1} and for all x G X, F{x) = {1 — x} and r{x) = x. There is a unique 
invariant measure u = 1/26q + l/25i, and the general uniform value exists and 
satisifes v*{x) = | for all states x. Notice that considering evaluations 9 = {6t)t 
such that 9t is small for each t without requiring I [9) small, would not necessarily 
lead to V*. Consider for instance 6'^ = n^'^t for each n, we have V0n(x) = x 

for all a; in X. 



21 



Example 3.12. The state space is the unit circle, let X = {x G C, |x| = 1} 
and F(e*") = e*^""*"^^ for all real a. If we denote by /i the uniform distribution 
(Haar probability measure) on the circle, the mapping F is /i-ergodic and /i is F- 
invariant. By Birkhoff's theorem (1931), we know that the time average converges 
to the space average /i- almost surely. Here /i is the unique invariant measure, and 
we obtain that the general uniform value is the constant : 

Vx G X, v\x) = — [ r{e''')da. 

Notice that the value ve{x) converges to v*{x) for all x in X, and not only for 
/i-almost all x in X. 

Example 3.13. Let F = (X, F, r) be a MDP satisfying the hypotheses of the 
theorem 13.91 such that for all x G X, 5^ G -F(x). Therefore the set R is equal to 
A(X). In the terminology of Gambling Theory (see Maitra Sudderth, 1996), F 
is called a leavable gambling house since at each stage the player can stay at the 
current state. The limit value v* is here characterized by : 

V* = mi{v : X — i- [0, 1] C*^, v is excessive and v > r}. 

In the above formula, v excessive means : Vx G X, t>(x) > sup^^pf^^j ]Eu{v). This 
is a variant of the fundamental theorem of gambling theory (see section 3.1 in 
Maitra Sudderth 1996). 

Example 3.14. The following deterministic Gambling House, which is an ex- 
tension of example 1.4.4. in Sorin (2002) and of example 5.2 of Renault (2011), 
shows that the assumptions of theorem 13.91 allow for many speeds of conver- 
gence to the limit value v*. Here Z > 1 is a fixed parameter, X is the simplex 
{x = {p"" , p'^ , p"^) G lR^,p"' +p^ + p'^ = 1} and the initial state is Xq = (1, 0, 0). The 
payoff is r{p°-,p'',p'^) = p^ —p'^-, and the transition is defined by : F[p"',p'',p'^) = 
{((1 - «-a')p",/ + ap",p^ + ay),a G [0,1/2]}. 

The probabilistic interpretation is the following : there are 3 points a, b and 
c, and the initial point is a. The payoff is at a, it is +1 at b, and -1 at c. At 
point a, the decision maker has to choose a G [0, 1/2] : then b is reached with 
probabihty a, c is reached with probability a', and the play stays in a with the 
remaining probability 1 — a — aK When b (resp. c) is reached, the play stays at b 
(resp. c) forever. So the decision maker starting at point a wants to reach b and 
to avoid c. By playing at each stage a > small enough, he can get as close to b 
as he wants. 

Back to our deterministic setup, we use norm and obtain that X is com- 
pact, F is non expansive and r is continuous, so that theorem 13.91 applies. The 
limit value is given by v* {p"- , p^ , p'^) = + p^, and if we denote by xa the value 
fA(xo), we have for all A G (0, 1] : Xa = 4>{x\), where for all x E M, 

Six) = max (1 — A)(l — a — a')x + a. 

ae[0,l/2]' 
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Since xx e (0, 1), the first order condition gives (1 — A)a;A(— 1 — la'- ^) + 1 = 
and we can obtain : 




^ (1-A) V V(l-A)(/-1), 
Finally we can compute an equivalent of xx as A goes to 0. We have 



i-i 

A A ' , 1 , i-i . i-i 



;i-A)(/-i); 'i-i 



(- — -) — A— (1 + o(A— )) 



so that 

^A(a;o) = (1 - A) 



1 



/((^)-a¥ + o(a^))+i 

t;A(a;o) = 1 - CA^ + o(A^) with C ^ 



3.2 Long-term values for standard MDPs 

A standard Markov Decision Problem \l/ is given by a non empty set of states 
X, a non empty set of actions A, a mapping q : X x A ^ Af{X) and a payoff 
function g : X x A ^ [0,1]. At each stage, the player learns the current state x 
and chooses an action a. He then receives the payoff g{k, a), a new state is drawn 
accordingly to q{k, a) and the game proceeds to the next stage. 

Definition 3.15. A pure, or deterministic, strategy is a sequence of mappings 
a = {crt)t>i where at : (X x A)^^-^ — t- A for each t. A strategy (or behavioral 
strategy) is a sequence of mappings a = {cTt)t>i where at : {X x AY^^ — )• Af{A) 
for each t. We denote by S the set of strategies. 

A pure strategy is a particular case of strategy. An initial state xi in X and 
a strategy a naturally induce a probability distribution with finite support over 
the set of finite histories (X x A)"- for all n, which can be uniquely extended to 
a probability over the set (X x A)°° of infinite histories. 

Definition 3.16. Given an evaluation 9 and an initial state xi in X, the 9 -payoff 
of a strategy a at Xi is defined as 76i(xi,cr) = lE^^,, {J2t>i^i9(^i^ ^t)) > '^'^^ 
9-value at Xi is : 

ve{xi) = sup 70 (xi, a). 

As for gambling houses, it is easy to see that the supremum can be taken 
over the smaller set of pure strategies, and one can derive a recursive formula 
linking the value functions. General limit and uniform values are defined as in 
the previous subsection 13.11 
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Definition 3.17. Let = {X,A,q,g) be a standard MDP. 

\& has a general limit value v* if {vq) uniformly converges to v* when I{9) goes 
to zero, i.e. for each e > one can find a > such that : 

\/e, ( 1(6) <a =^ (Va; G X, \ve{x) - v*{x)\ < e) ) . 

\l/ has a general uniform value if it has a general limit value v* and moreover 
for each e > one can find a > and a behavior strategy cr{x) for each initial 
state X satisfying : 

V^, {l{e)<a =^ {yxeX,-fg{x,a{x))>v*{x)-e)). 

We now present a notion of invariance for the MDP The next definition will 
be similar to definition \3.8\ however one needs to be shghtly more sophisticated 
here to incorporate the payoff component. Assume now that X is a compact 
metric space, and define for each {u,y) in A/(X) x [0, 1], 

F{u, y) = \ [ u{x)q{x, a{x)), u{x)g{x, a{x)) j , where a : X — )• Af{A) > . 

I \x&X xGX J ) 

where q{x, .) and g{x, .) have been linearly extended for all x. We have defined 
a correspondence F from Af{X) x [0, 1] to itself. It is easy to see that F always 
is an affine correspondence (see lemma [3.261 later) . In the following definition we 
consider the closure of the graph of F within the compact set (A(X) x [0, 1])^, 
with the weak topology. 

Definition 3.18. An element {u,y) in A(X) x [0,1] is said to be an invariant 
couple for the MDP \E' if {{u,y), {u,y)) G cl{Graph{F)). The set of invariant 
couples of^ is denoted by RR. 

Our main result for standard MDPs is the following, where X is assumed to 
be a compact subset of a simplex A{K). Recall that Di = {/ G C{A{K)),'^x, y G 
A{K),Wa,b > 0, af{x) — bf{y) < \\ax — by\\i}, and any / in Di is linearly 
extended to A{A{K)). 

Theorem 3.19. Let \I' = {X, A,q, g) be a standard MDP where X is a compact 
subset of a simplex A{K), such that : 

Vx G X.My G X,Va G AV/ G Di,V« > 0,V/3 > 0, 

\af{q{x,a))- p>f{q{y,a))\ < \\ax - l3y\\i and \ag{x, a) - l3g{y, a)\ < \\ax - Py\\i. 
then \1/ has a general uniform value v* characterized by : for all x in X, 

v*{x) = inf {w{x), w : A{X) [0, 1] affine C° s.t. 

(1) Vx G X,w{x') > sup w{q{x', a)) and (2) \/{u,y) G RR,w{u) >y]. 
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The proof of theorem 13.191 will be in section 13.41 An immediate corollary is 
when the state space is finite. 



Corollary 3.20. Consider a standard MDP {K, A, q, g) with a finite set of states 
K. Then it has a general uniform value v* , and for each state k : 

v*{k) = inf {w{k),w : A{K) [0, 1] affine s.t. 

(1) \/k' G K,w{k') > sup w{q{k', a)) and (2)V(p, G RR,w{p) >y}. 

aeA 

with RR = {{p,y) G A{K) x [0,1], {{p,y), {p,y)) G cl{conv{Graph{F)))} and 
Fi.k,y) = {{q{k,a),g{k,a)),a G A}. 

Proof : K is viewed as a subset of the simplex A{K), endowed with the L^-norm. 
Fix k, k' in K , a m A, a > and /3 > 0. We have 



\a- I3\iik = k' , 
a + (3 otherwise. 

a — P\g{k, a) il k = k' 
P otherwise 




\\ak — pk'W ■■ 

First, 

\ag{k,a) - (3g{k',a)\ < 

, so in all cases \ag{k,a) — (3g{k',a)\ < \\ak — I3k'\\. Secondly, consider / G Di. 
f takes values in [—1,1], so similarly we have : \af{q{k,a)) — (3 f {q{k' , a))\ < 
\\ak — Pk'\\. So we can apply theorem 13. 191 and the graph of F is the convex hull 
of the graph of F. □ 

Remark 3.21. When the set of actions is finite, we are in the setting of Blackwell 
(1962) and the value is characterized by the Average Cost Optimality Equation. 
In fact in this setting, our characterization leads to a dual formulation of a result 
of Denardo and Fox (1968). Denardo and Fox (1968) showed that the value v* is 
the smallest (pointwise) excessive function for which there exists a vector h G 
such that [v*, h) is superharmonic in the sense of Hordjik and Kallenberg (1979) 
, i.e. 

^keK, aeA v*{k) + h{k) > r{k, a) + ^ q{k, a){k')h{k'). (2) 

k' 

Given a function w the existence of a vector h such that {w, h) is superharmonic is 
a linear programming problem with K xA inequalities. By Farkas' lemma it has a 
solution if and only if a dual problem has no solution, and the dual programming 
problem is to find a solution vr G M^^^ of the following system : 

^keK J2a'eA ^(^' «') = J2k'eK,a'eA ^i^', a')q{k' , a') (k) 

y{k,a)eKxA 7T{k,a) >0 

Wk e K Xla'eA ^(^' a)g{k, a') > v{k). 



25 



If we denote by p the marginal of vr on and define for all k such that p{k) > 0, 
a{k) = and set a{k) to any probability otherwise, then a is a strategy in 

the MDP. Moreover p is invariant under a and the stage payoff y is greater than 
v{p), thus the couple {p,y) is in RR and the condition (2) in corollary 13.201 is not 
satisfied. Reciprocally since the action state is compact, given {p,y) G RR, there 
exists a strategy cr such that p is invariant under a and the payoff is y. Therefore 
if the condition (2) is not true then there exists h G M'' such that {w, h) is 
superharmonic. Note that Denardo and Fox state a dual of the minimization 
problem and obtain an explicit dual maximization problem whose solution is the 
value. Hordjik and Kallenberg exhibit from the solutions of this dual problem an 
optimal strategy. 



In this section we consider a compact metric space (X, d) , and we use the 
Kantorovich- Rubinstein distance d = dxR on A(X). We write Z = Af{X), Z = 
A(X). We start with a lemma. 

Lemma 3.22. Let F : X ^ Af{X) be non expansive for dxR. Then the mixed 
extension of F is 1 - Lip schitz from A/(X) to A/(X) for dxR- 

Proof of lemma 13.221 We first show that the mapping {p i— convF(j9)) is non 
expansive from X to Z. Indeed, consider p and p' in X, and u = X^ie/ ctj'^j) with 
/ finite, ai > 0, G F{p) for each i, and ^jgj = 1. By assumption for each i 
one can find m- in F{p') such that dKR{ui,u'^ < d{p,p'). Define u' = X]iG/'^«'"i 
convF(p'). We have : 



We now prove that F is 1-Lipschitz from Z to Z. Let ui, U2 be in Z and 
Vi = Xlpex where fi{p) G convF(p) for each p. By the Kantorovich 

duality formula, there exists a coupling 7 = (7(p, q)){p,q)exxx in Ay(X x X) with 
first marginal Ui and second marginal U2 satisfying : 



3.3 Proof of theorem [3791 





< 



d{p,p'). 





(p,q)eXxX 



26 



For each p, q in X by the first part of this proof there exists f^{q) G convF(g) 
such that dxRif^iq), flip)) < d{p,q). We define : 

Ml) = E %y/'(5) e convF(g), and = J2M<l)f2{q) G F{u2). 
We now conclude. 

dKRivi,V2) = dKRi^Ui{p)fi{p),^U2{q)f2{q) 

= dKRi^-f{p,q)fi{p),^-f{p,q)Fiq) 
\ p,g g,p 

< Y.^{p.q)dKR{h{p).F{q)) 

p,g 

< ^^P^ ^)^^P^ ^) = dKR{uu U2). 

p,g 

The mixed extension of F is 1-hpschitz. □ 



We now consider a Gambhng House F = (X, F, r) and assume the hypotheses 
of theorem 13 .91 are satisfied. We will work[^ with the deterministic Gambling House 
F = {Af{X), F, r). Recall that r is extended to an affine and continuous mapping 
on A(X) whereas F is an affine non expansive correspondence from Z to Z. 

For p in X, the pure plays in F at the initial state 6p coincide with the mixed 
plays in F at the initial state p. As a consequence, the 6'-value for F at p coincides 
with the 6'- value for F at 6p, which is written vg{p) = ve{6p). Because F and r are 
affine on Z, the 6'-value for F, as a function defined on Z, is the affine extension 
of the original ve defined on X. So we have a unique value function vg which is 
defined on Z and is affine. Because F is 1-Lipschitz and r is uniformly continuous, 
all the value functions vg have the same modulus of continuity as r, so {vg)g is an 
equicontinuous family of mappings from Z to [0, 1]. Consequently, we extend vg 
to an affine mapping on Z with the same modulous of continuity, and the family 
{vg)g now is an equicontinuousE] family of mappings from Z to [0, 1]. 

We define R and v* as in the statements of theorem 13. 9[ so that for all x in 



2. A variant of the proof would be to consider the Gambhng House on A(X) where the 
transition correspondence is defined so that its graph is the closure of the graph of F. Part 1) 
of lemma [3.231 shows this correspondence is also non expansive. 

3. Z being precompact, this is enough to obtain the existence of a general limit value, see 
Renault 2012b. Here we will moreover obtain a characterization of this value and the existence 
of the general uniform value. 
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v*{x) = inf {w{x),w : Z ^[0, 1] affine s.t. 

(1) Wy E X,w{y) > sup w{u) and (2)Vm E R,w{u) > r{u) }. 

ueF{y) 

We start with a technical lemma using the non-expansiveness of F. 

Lemma 3.23. 1) Given {u,u') in c\{Graph{F)), v in Z and £ > 0, there exists 
v' E F{v) such that d{u', v') < d{u, v) + e. 

2) Given a sequence {zt)t>Q of elements of Z such that {zt, zt+i ) E c\{Graph{F)) 
for all t > 1, for each e one can find a sequence {z^)t>o of elements of Z such 
that {z[)t>i is a play at z'q, and d{zt, z[) < e for each t > 0. 

Proof of lemma 13.231 : 1) For all £ > there exists {z,z') E Graph{F) such 
that d{z, u) < e and d{z' ., u') < e. Because F is non expansive, one can find v' in 
F{v) such that d{z' ,v') < d{z,v). Consequently, d{v',u') < d{v',z') + d{z',u') < 
d{z, v) + e < d{u, v) + 2e. 

2) It is ffist easy to construct {z'^, z'^) in the graph of F such that d{z'i^, zq) < e 
and d(z[^ zi) < e. {zi, Z2) E c\{Graph{F)) so by 1) one can find (z'2) in F{z[) such 
that d{z2,z'2) < d{zi,z[) + e:^ < e + e"^. Iterating, we construct a play (-2t)t>i at 
Zq such that d{zt, z[) < e + e"^ + ... + ioi each t. 

Proposition 3.24. F has a general limit value given by v*. 

Proof of proposition 13.241 : By Ascoli's theorem, it is enough to show that 
any limit point of {vg)0 (for the uniform convergence) coincides with v*. We thus 
assume that {vQk)i^ uniformly converges to f on Z when k goes to cxd, for a family 
of evaluations satisfying : 

i^f+i - ^fi — ^k^oo 0. 

t>i 

And we need to show that v = v* . 

A) We ffist show that v >v* . 

It is plain that v can be extended to an affine function on Z and has the 
same modulus of continuity of r. Because 'Yl,t>i \^t+i ~ \ — ^fe^oo 0, we have by 
equation ([1]) of section [STT] that : Vy E X,v{y) = sup„g^(^) f (m). 

Let now u be in R. By lemma 13.231 for each e one can find uq in Z and a play 
(mi,M2, ...,Ut, ...) such that Ut E F{ut-i) and d{u,Ut) < e for all t > 0. Because r 
is uniformly continuous, we get v{u) > r{u). 

By definition of v* as an infimum, we obtain : v* < v. 

B) We show that v* > v. Let w he a continous affine mapping from Z to [0, 1] 
satisfying (1) and (2) of the definition of v*. It is enough to show that w{p) > v{p) 
for each p in X. Fix p in X and e > 0. 
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For each k, let a'' = {u\, , ...) G be a play at 6p for f which is almost 
optimal for the value, in the sense that J2t>i ^t'^'i'^t) — '^e'^ip) ~ ^- Define : 

oo oo 

u{k) = J2 (^t< e and u\k) = ^Mi e Z. 
t=i t=i 

u{k) and u'{k) are well-defined limits of normal convergent series in the Banach 
space C(X)'. Because F is affine, its graph is a convex set and {u{k),u'{k)) G 
c\{Graph{F)) for each k. 

Moreover, we have d{u{k) , u' (k)) < diam(X)(6'J^ + ~ ^t~i\)^ where 

diam(X) is the diameter of X. Consequently, ^j>i 16*^+1 — 9t\ — >k^oo implies 
d{u{k) , u' (k)) — >k^oo 0. Considering a limit pointTof the sequence {u{k),u'{k))k, 
we obtain some u in R. By assumption on w, w{u) > r{u). Moreover, for each k 
we have r{u{k)) = J2t>i ^t'^'i'^t) — '^e>'{p) ~ ^1 so r{u) > v{p) — e. 

Because w is excessive, we obtain that for each k the sequence {w{u^))t is non 
increasing, so w{u{k)) = J2t>i ^t'^i'^t) — w{p). So we obtain : 

w{j)) > w{u) > r{u) > v{p) — e. 

This is true for all e, so w > f . □ 

Proposition 3.25. F has a general uniform value. 

Proof of proposition 13.251 : First we can extend the notion of mixed play to 
Z. A mixed play at uq E Z, is a. sequence a = {ui, ...,Ut, ...) G Z°° such that 
Ut+i G F{ut) for each t > 0, and we denote by S(mo) the set of mixed play 
at Uq. Given t,T in IV, n G IV* and Mq G Z, we define for each mixed play 
c" = {ut)t>i G 2(^0 ) the auxiliary payoff : 

^ t+n 

7i,n(o") = - r{ui), and/3T,n(cr) = inf 7t,n(cx). 
i=t+i I ' ' -f 

And we also define the auxiliary value function : for all u in Z, 

hT,n{Uo) = sup (3T,n{,Cr). 
creS{no) 

Clearly, /3r,n(cr) < 7o,n(cr) and /ir,n(Mo) < Vniuo)- We can write : 

^ T t+n 

^T,n(Mo) = sup inf -y^dty^r{ui) 

T+n 

= sup inf > Bi(9,n)r(ui). 
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where for each / in 1, T + n, 



Mm{T,«-l} 



n 

t=Max{0,l-n} 

By construction, F is affine, so S(mo) is a convex subset of Z°°. A({0, ...,T}) is 
convex compact and the payoff ^^^^i (ii{0,n)r{ui) is affine both in 6 and in a. 
We can apply a standard minmax theorem to get : 

T+n 

hT,ni^o) = inf sup y^Bi{e,n)r{ui). 

We write = for t > T and for each / > : Pi{n,e) = ^{Oq + ... + Oi^i) 
if / < n, (3i{e,n) = + ... + 9i_,) ii n + l < I < n + T, (3i{n,9) = 

if / > n + T. The evaluation P{6,n) is a particular probability on stages and 
^T,n(wo) = inf6iGA({o,...,T}) ^/3{6»,n) (%) • It IS sasy to bound the impatience of (3{9, n) : 



n— 1 

— ^ — ^ n — ^ n n 



i-OO 0. 



l>0 1=0 l>n 



The impatience of f}{9, n) goes to zero as n goes to infinity, uniformly in 9. So we 
can use the previous proposition 13.241 to get : 

Ve > 0,3no,Vn > no,V^ G A(IV),Vno G Z, |t;/3(e,n)M - < e. 

This implies that /ioo,n(Mo) :=de/ infeeA{iV) ?^/3{e,n)(Mo) = infT>o hT,n{,Uo) converges 
to v*{uo) when n — )■ oo, and the convergence is uniform over Z. Consequently, if 
we fix e > there exists uq such that for all Uq in Z, for all T > 0, there exists a 
play CT"^ = {uj)t>i in S(mo) such that the average payoff is good on every interval 
of no stages starting before T + 1 : for alH = 0, T, ■jt,no{o''^) > v*{uo) — e. 

We fix Mo in Z and consider, for each T, the play = {uf)t>i in S('u) as 
above. By a diagonal argument we can construct for each t > 1 a limit point Zf in 
Z of the sequence {uJ')t>o such that for each t we have {zt, Zt+i) G c\{Graph{F)), 
with zq = uq. For each m > 0, we have ^ ^("^1) — '^*('"o) — £ for T large 

enough, so at the limit we get : ^ jyt^+i° ^{^t) > ^^*(^o) — ^■ 

r being uniformly continuous, there exists a such that \r{z) — r{z')\ < e as soon 
as d{z,z') < a. By lemma 13.231 one can find a cr' = {z[, ....,2^, ...) at T,{zo) such 
that for each t, d{zt, 4) < We obtain that for each m > 0, ^ Yl't^n+i° ^(4) — 
v*{u)-2e. 

Consequently we have proved : Ve > 0, there exists uq such that for all initial 
state p in X, there exists a mixed play cr' = {z[)t at p such that : Vm > 0, 
;^ Ellm+r ^(-^t) ^ v*{p)-2e. Let 6^ G A(1V*) be an evaluation, it is now easy to 
conclude. First if v*{p) — 2e < 0, then any play is 2e-optimal. Otherwise, for each 
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j > 1, denote by 9j the maximum of 9 on the block = {{j — l)nQ + 1, jno}. 
For all t & B\ we have : 

9j > Of > 9j — \9t'+i — 9t'\. 

t'e{{j-l)no + l,...jno~l} 

As a consequence, for all j we have : 

jno jno 

M^D>% E K^O - ^0 E l^t'+i-^t'l 

t=(i-l)no + l t=(j-l)no+l t'G{(j-l)no+l,...jno-l} 

jno 

> Yl et{v*{p)-2e) - no \dt'+i-et'\ 

t={j-l)no+l i'e{(j-l)no+l,...Jno-l} 



and by summing over j, we get : "ie^xo, a') > v*{j>) — 2e — noI{6) > v*{jp) — 3e as 
soon as I{0) is small enough. □ 

3.4 Proof of theorem 13.191 

Assume that X is a compact subset of a simplex A{K), and let \E' = (X, A, q, g) 
be a standard MDP such that : Vx G X,V?/ G X,Va G A,V/ G L'i,Va > 0,V/3 > 
0, 

|a/(g(x,a)) -/3/(g(y,a))| < ||aa; - and \ag{x, a) - l3g{y, a)\ < \\ax - l3y\\i. 

We write Z = A/(X) x [0, 1], and Z = A(X) x [0, 1]. We will use the metric = 
do = di = d2 = d^ on A(A(ir)) introduced in section [2l3] and its restriction to 
A(X), so that Z is a compact metric space. For all {u,y), {u',y') G A/(X) x [0, 1], 
we put d{{u,y), {u',y')) = max{d^:{u,u'), \y — y'\) so that {Z,d) is a precompact 
metric space. Recall we have defined the correspondence F from Z to itself such 
that for all {u, y) in Z, 

Hu,y) = {{Q{u,a),G{u,a)) s.t. a : X ^ A^(A)}, 

with the notations Q{u, a) = Xlxex u{x)q{x, cr{x)) and G{u, cr) = J2xex '^i 
And we simply define the payoff function r from Z to [0, 1] by r{u, y) = y for all 
{u, y) in Z. We start with a crucial lemma, which shows the importance of the 
duality formula of theorem 12.181 

Lemma 3.26. F is an affine and non expansive correspondence from Z to itself. 

Proof of lemma EUS We first show that : Vu, u' G A/(X), Va G [0, 1], Vy, y' G 
[0,1], F{au + (1 - a)u\ay + (1 - a)y') = aF{u,y) + (1 - a)F{u',y'). First 
the transition does not depend on the second coordinate so we can forget it for 
the rest of the proof. The C part is clear. To see the reverse inclusion, consider 
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a : X ^ Af{A), a' : X ^ ^/(^) and v = aY^xex ^i^)) + (1 " 

a) X]a;ex u'{x)q{x, cr'(x)) in aF{u) + (1 — a.)F{u'). Define 

^ au{x)a{x) + (1 — a)u\x)a\x) 

au{x) + (1 — a)u'{x) 

for eacli x sucli tliat tlie denominator is positive. Tlien v = J^xexi^'^ + (1 ^ 
a)u'{x))q{x,a*{x)), and F is affine. 

We now prove tliat F is non expansive. Let z = {u, y) and z' = {u', y') be in 
Z. We liave d{{u,y), {u',y')) > d^:{u,u') and denote by U and U' the respective 
supports of u and u'. By tlie duality formula of theorem I2.18[ there exists a = 
{(y{p,p')){p,p')<^uxu' and (3 = {f3{p,p'))(p^p')^uxu' with non-negative coordinates 
satisfying : Ep'gc/' = all P ^ U, T^peu l^iP^P') = "'(p') 

p' E U', and 

4(m,m')= X] \\pa{p,p') -P' P{p,p')\\i- 

{p,p')&UxU' 

Consider now v = Q{u,a) = J2p£U '^(p)^(P^ ^(p)) ^'^^ some a : X ^ Af{A). 
We define for all p' in U' : 

and v' = Q{u\a') = '^pi^uiu'{p')q{p\a'{p')). Then v' G F{u',y'), and for each 
test function in Di we have : 

\ip{v) - ip{v')\ = \ ^a{p,p')ip{q{p,a{p))) - I3{p,p')ip{q{p,a{p)))\ 
p,p' 

= I (^(P^P')^(p)((^)v{(l{P,a)) - P{p,p')^{p){a)^{q{p',a))\ 

p,p',a 

< Y\\a{P,p')P - l^{P,p')p'\\i = d4u,u), 



p,p' 



and therefore d^{v,v') < d^:{u,u'). In addition we have a similar result on the 
payoff, 

\G{u,a)-G{u',a')\ = \J2^(p^p')9{p,ct{p)) - P{p,p')g{p',a{p))\ 

p,p' 

< X] \w{p,p')p - i^{p,p')p'\\i 
p,p' 

< d^{u, u). 

Thus we have d{{Q{u, a), R{u, a)), {Q{u', a'), R{u', (x'))) < d^{u, u') < d{z, z'). □ 
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Recall that the set of invariant couples of the MDP is : 

RR = {{u, y) G Z, ((n, y), (n, y)) G cl{Graph{F))], 

and the function v* : X — y M is defined by : 

v*{x) = inf {w{x),w : A(X) ^ [0, 1] affine C° si. 

(1) G X, ti'(y) > sup w{q{y, a)) and (2) V(m, y) G -R-R, w{u) > y }■ 

We now consider the deterministic Gambling House f = (Z,F,r). Z is pre- 
compact metric, F is affine non expansive and r is obviously affine and uniformly 
continuous. Given an evaluation 9, the ^- value of T at zq = {u, y) is denoted by 
ve{u,y) = Vein) and does not depend on y. The recursive formula of section [3?T] 
yields : 

\/{u,y)eZ, vg{u) = sup Oiy' + {1 - 9i)vg+{u') 

(u',y')&F{u) 

sup {eiG{u,a) + {l-ei)ve+{Q{u,a))). 

creX->A/(A) 

Because F and r are affine, vq is affine in u and the supremum in the above ex- 
pression can be taken over the function from X to A. Because F is non expansive 
and r is 1-Lipschitz, each vq is 1-Lipschitz. 

We denote by vq the ^- value of the MDP ^ and linearly extend it to Aj(X). 
It turns out that the recursive formula satisfied by vg is similar to the above 
recursive formula for vq, so that voiu) = vg{u, y) for all u in Aj(X) and y in [0, 1]. 
As a consequence, the existence of the general limit value in both problems F 
and ^ is equivalent. Moreover, a deterministic play in F induces a strategy in \E', 
so that the existence of a general uniform value in F will imply the existence of 
the general uniform value in ^> (note that deterministic and mixed plays in F are 
equivalent since F has convex values). 

It is thus sufficient to show that F has a general uniform value given by f*, 
and we can mimic the end of the proof of theorem 13.91 Lemma 13.231 applies word 
for word. Finally, one can proceed almost exactly as in propositions 13. 241 and 13 .25] 
to show that F, hence has a general uniform value given by v* . 

4 Applications to partial observation and games 

4.1 POMDP with finitely many states 

We now consider a more general model of MDP with actions where after each 
stage, the decision maker does not perfectly observe the state. A MDP with partial 
observation, or POMDP, F = {K, A, S, q, g) is given by a finite set of states K, a 
non empty set of actions A and a non empty set of signals S. The transition q now 
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goes from KxAto ^f{S x K) (by assumption the support of the signals at each 
state is finite) and the payoff function g still goes from K x A to [0, 1]. Given an 
initial probability pi on K, the POMDP T{pi) is played as following. An initial 
state ki in K is selected according to pi and is not told to the decision maker. At 
every stage t he selects an action at G A. He has a (unobserved) payoff g{kt, at) 
and a pair {st, h+i) is drawn according to q{kt, at). The player learns St, and the 
play proceeds to stage t + 1 with the new state kt+i. A behavioral strategy is now 
a sequence {crt)t>i of apphcations with for each t, at : {A x 5)*^^ — t- Af{A). As 
usual, an initial probability on K and a behavior strategy cr induce a probability 
distribution over {K x A x S)°° and we can define the 6'- values and the notions 
of general limit and uniform values accordingly. 

Theorem 4.1. A POMDP with finitely many states has a general uniform value, 
i.e. there exists v* : A{K) — )■ M with the following property : for each e > one 
can find a > and for each initial probability p a behavior strategy a{p) such that 
for each evaluation 9 with I{0) < a, 

Vj9 G A{K), \ve{p) — v*{p)\ < e and 76i(o"(p)) > v*{p) — e. 

Proof : We introduce \l/ an auxiliary MDP on X = A(i^) with the same set of 
actions A and the following payoff and transition functions : 

• r : X X A — )■ [0, 1] such that r{p,a) = ^fcg/^p(A;)5'(fc, a) for all p 
in X and a G A, 

• q : X X A-^ Af{X) such that 



ses 



where q{p, a, s) G A(_ft') is the belief on the new state after playing a 
at p and observing the signal s : 

Vfc'Gir,g(p,a,.)(A;')- - ^)(^ ' ^) 



<l{p,a){s) Y.kP^(li.k^o){s) 

The POMDP r(pi) and the standard MDP '^{pi) have the same value for 
all ^-evaluations. And for each strategy a in ^^(pi), the player can guarantee the 
same payoff in the original game T{pi) by mimicking the strategy a. So if we 
prove that \E' has a general uniform value it will imply that the POMDP F has a 
general uniform value. 

To conclude the proof, we will simply apply theorem 13. 191 to the MDP ^ . We 
need to check the assumptions on the payoff and on the transition. 

Consider any p, in X, a G A, a > and /3 > 0. We have : 

\ar{p,a) - /3r(p',a)| = | ^{ap^k) - l3p'{k))g{k, a)\ < \\ap - l3p'\\ 
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Moreover for any / G -Di, we have 



- /3g(p',a)(/)| = \^{aq{p,a){s)f{q{p,a,s)) - /3q{p',a){s)f{q{p',a,s))) 

s 

< Yl Mk')q{k',a){k,s)-^p'{k')q{k',a){k,s)\ 

s,k,k' 

< Y^ o.){k, s)\ap{k') — (3p'{k')\ = \\ap — (3p'\\. 

s,k,k' 

where the first inequahty comes from the definition of Di. 

By theorem 13 .191 the MDP \1/ has a general uniform value and we deduce that 
the POMDP r has a general uniform value. □ 

Example 4.2. Let T = (K, A, S, q, g,pi) be a POMDP where K = {ki,k2}, 
A = {a,b}, S = {s} and pi = dk^. The initial state is ki and since there is only 
one signal, the decision maker will obtain no additional information on the state. 
We say that he is in the dark. The payoff is given by (7(0, a) = g{0, h) = g{l, b) = 
and g{l,a) = 1, and the transition by q{l,a) = q{l,b) = 6i^s, q{0,a) = 5o,s and 
q{0,b) = |5o,s + |'5i,s- On one hand if the decision maker plays a then the state 
stays the same and he receives a payoff of 1 if and only if the state is 1, on the 
other hand if he plays b then he receives a payoff of but the probability to be 
in state 1 increases. 

We define the function r from X = A{K) to [0, 1] by r{{p, 1 — p),a) = 1 — p 
and r{{p,l — p),b) =0 for all p G [0, 1], and the function q from X to Aj(X) by 

q{{p, l-p),a) = 5(p,i_p) and q{{p, l~p),b) = 5(p/2,i-p/2)- 

Then the standard MDP \1' = {A{K), A,r,q) is the MDP associated in the pre- 
vious proof to r. This MDP is deterministic since the decision maker is in the 
dark. 

In this example, the existence of a general uniform value is immediate. If we 
fix n e IN, the strategy cr = b'^a°° which plays n times b and then a for the rest 
of the game, guarantees a stage payoff of (1 — from stage n + 1 on, so the 
game has a general uniform value equal to 1. Finally if we consider the discounted 
evaluations, one can show that the speed of convergence of vx is slower than A : 

^a(pi) = 1-|^A + 0(A). 

All the spaces are finite but the partial observation implies that the speed of 
convergence is slower than A contrary to the perfect observation case where it is 
well known that the convergence is in 0(A). 
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Remark 4.3. It is unknown if the uniform value exists in pure strategies, i.e. if the 
behavior strategies a{p) of theorem 14. II can be chosen with values in A. This was 
already an open problem for the Cesaro-uniform value (see Rosenberg et al. 2002 
and Renault 2011 for different proofs requiring the use of behavioral strategies). 
In our proof, there are two related places where the use of lotteries on actions is 
important. First in the proof of the convergence of the function hx^n (within the 
proof of theorem l3.9l) . we used Sion's theorem in order to inverse a supremum and 
an infimum so we need the convexity of the set of strategies. Secondly when we 
prove that the extended transition is 1-Lipschitz (see lemma [3. 26 p . the coupling 
between the two distributions u and u' introduces some randomization. 

4.2 Zero-sum repeated games with an informed controller 

We finally consider zero-sum repeated games with an informed controller. We 
start with a general model T = {K, I, J,C, D,q, g) of zero-sum repeated game, 
where we have 5 non empty finite sets : a set of states K, two sets of actions I and 
J and two sets of signals C and D, and we also have a transition mapping q from 
K X I X J to A{K xC xD) and a payoff function g from K x I x J to [0, 1]. Given 
an initial probability vr on A{K xC xD), the game r(7r) = T{K, I, J, C, D, q, g, vr) 
is played as follows : at stage 1, a triple (fci, ci, di) is drawn according to vr, player 

1 learns ci and player 2 learns di. Then simultaneously player 1 chooses an action 
ii in / and player 2 chooses an action ji in J. Player 1 gets a (unobserved) payoff 
r{ki,ii,ji) and player 2 the opposite. Then a new triple {k2,C2,d2) is drawn 
accordingly to q{ki,ii, ji). Player 1 observes C2, player 2 observes ^2 and the 
game proceeds to the next stage, etc... 

A (behavioral) strategy for player 1 is a sequence a = {crt)t>i where for each 
t > 1, at is a. mapping from (C x /)*~^ x C to A(/). Similarly a strategy for player 

2 is a sequence of mappings r = {rt)t>i where for each t > 1 ,rf is a mapping from 
{D X jy^^ X D to A(J). We denote respectively by S and T the set of strategies 
of player 1 and player 2. An initial distribution vr and a couple of strategies [a, r) 
defines for each t a probability on the possible histories up to stage t. And by 
Kolmogorov extension theorem, it can be uniquely extended to a probability on 
the set of infinite histories {K x C x D x I x J)^"^ . 

Given 9 an evaluation function, we define the ^-payoff of (cr, r) in r(7r) as the 
expectation under lP.„^a,T of the payoff function. 



and we can define the general limit value as in the MDP framework. Note that 
we do not ask the convergence to be uniform for all vr in A{K x C x D), because 
we will later make some assumptions, in particular on the initial distribution. 




By Sion's theorem the game 7e(vr) has a value : 

V0{7c) = max min 70 (vr, cr, r) = min max 70 (tt, cr, r) 
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Definition 4.4. The repeated game T{7t) = {K, I , J,C, D,q, g,7r) has a general 
limit value v*{7i) if vg{Ti) converges to v*{7i) when I{6) goes to zero, i.e. : 

V£ > 0,3a > 0,V^, {l{e)<a =^ ilveiir) - v* (tt)] < e) ) . 

And we can define a general uniform value by symmetrizing the definition for 
MDP. 

Definition 4.5. The repeated game T{tt) has a general uniform value if it has 
a general limit value v* and for each e > one can find a > and a couple of 
strategies a* and t* such that for all evaluations 9 with I{9) < a : 

W E T, 'jg{n, a*, r) > v*{n) - e and Va G S, 7e(7r, a, r*) < v*{'k) + e. 

We now focus on the case of a repeated game with an informed controller. We 
follow the definitions introduced in Renault (2012a). The first one concerns the 
information of the first player. We assume that he is always fully informed of the 
state and of the signal of the second player : 

Assumption 4.6. There exist two mappings k : C ^ K and d : C D such 
that, if E denotes {{k,c,d) G K x C x D, k{c) = k, d{c) = d}, we have : 
\/{k,iJ) eK X I X J, q{k,i,j){E) = 1, and-K{E) = 1. 

Moreover we will assume that only player 1 has a meaningful influence on the 
transitions, in the following sense. 

Assumption 4.7. The marginal of the transition on K x D is not influenced by 
player 2's action. For k in K, i in I and j in J, we denote by q{k, i) the marginal 
ofq{k,i,j) on K X D. 

The second player may influence the signal of the first player but he can not 
prevent him neither to learn the state nor to learn his own signal. Moreover he can 
not influence his own information, thus he has no influence on his beliefs about the 
state or about the beliefs of player 1 about his beliefs. A repeated game satisfying 
assumptions 14.61 and 14.71 is called a repeated game with an informed controller. 
It was proved in Renault (2012a) that for such games the Cesaro-uniform value 
exists and we will generalize it here to the general uniform value. 

Example 4.8. We consider F a zero-sum repeated game with incomplete in- 
formation as studied by Aumann and Maschler (see reference from 1995). It is 
defined by a finite family {G^)k£K of payoff matrices in [0, l]'^^"^ and p G ^{K) 
an initial probability. At the first stage, some state k is selected according to 
p and told to player 1 only. The second player knows the initial distribution p 
but not the realization. Then the matrix game is repeated over and over. 
At each stage the players observe past actions but not their payoff. Formally it 
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is a zero-sum repeated game F = {K, I, J, C, D, q, g) as defined previously, with 
C = K X I X J and D = I xJ, and for all {k, i,j)eKxIxJ, g{k, = G^{i,j) 
and q{k,i,j) = 5k,(k,i,j),(ij)- For all p G A(_ft'), we denote by r(p) the game where 
the initial probability tt G ^{K xCxD) is given by tt = J2keK Pi^)^k,{k,io,jo),iio,jo) 
with (^o, Jo) ^ I X J fixed. 

For each n, we denote by Vn{p) the value of the ra-stage game with initial 
probability p, where the payoff is the expected mean average of the n first stages. 
It is known that it satisfies the following dynamic programming formula : 

Vn{p)= sup \-r{p,a) + - ^ p^a!'{i)vn-i{q{p,a,i)) 

where p G A(i^), r{p,a) = minj(^^p'^G'^(a'^, j)) and q{p,a,i) is the conditional 
belief on ^{K) given p, a, i : 

q{p,a,t){k) = - 



Starting from a belief p about the state, if player 2 observes action i and knows 
that the distribution of actions of player 1 is a, then he updates his beliefs to 
q{p,a,i). Aumann and Maschler have proved that the limit value exists and is 
characterized by 

V* = cav/* = inf{t> : A{K) — )• [0, l],v concave v > /*}, 

where f*{p) = Val {'^kP'^G'^) for all p G A{K). The function /* is the value of 
the game, called the non-revealing game, where player 1 is forbidden to use his 
information. 

Theorem 4.9. A zero-sum repeated game with an informed controller has a ge- 
neral uniform value. 



Proof of theorem 14.91 : Assume that r(7r) = {K, I, J,C, D,q, g, tt) is a repeated 
game with an informed controller. The proof will consist of 5 steps. First we in- 
troduce an auxiliary standard Markov Decision Process \l'(7r) on the state space 
X = A{K). Then we show that for all evaluations 6, the repeated game F(7r) 
and the MDP \E'(7r) have the same ^- value. In step 3 we check that the MDP 
satisfies the assumption of theorem 13.191 so it has a general limit value and a 
general uniform value v*. As a consequence the repeated game has a general limit 
value v*{7r). Then we prove that player 1 can use an e-optimal strategy of the 
auxilliary MDP in order to guarantee f*(7r) — e in the original game. Finally we 
prove that Player 2 can play by blocks in the repeated game in order to guarantee 
v*{7t) + e. And we obtain that v*{Tr) can be guaranteed by both players in the 
repeated game, so it is the general uniform value of r(7r). 
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For every P G A{K x C x D), we denote by P the marginal of P on K x D 
and we put P = ipoiP) where ipD is the disintegration with respect to D (recall 
proposition I2.2ip : for all /i G A{K x D), ipoilj) = Tl,deD l^i^)^tJ-(-\d)- 

Step 1 : We put X = A{K) and A = A(J)^ and for every p in X, a in A 
and 6 in A (J), we define : 

r{p,a,b)= J2 e [0,1], 

{k,i,j)eKxIxJ 

r{p,a) = inf r{p, a, b) = inf r{p, a, j), 

f)GA(J) jGJ 

qip,a)= Yl P^ci\i)q{k,i)eA{KxD), 

{k,i)£KxI 

q{p,a) = i)D{q{p,a)) = Yq{p,a){d)Sq^p^a,d) ^ ^/(^)- 

Here q{p,a,d) G A{K) is the belief of the second player on the new state after 
observing the signal d and knowing that player 1 has played a at p : 

VA;' G K,q{p,a,d){k') = = T^kP'^jk, aik))ik' , d) ^ 

q{p,a){d) EfcP^9(^>a(^))(c?) 

We define the auxiliary MDP \E' = (X, A, q, r), and denote the ^- value in the MDP 
by vg. The MDP with initial state n has strong links with the repeated game r(7r). 

Step 2 : By proposition 4.23, part b) in Renault 2012a), we have for all 
evaluations 9 with finite support : 

ve{n) = Vein). 

The proof relies on the same recursive formula satisfied by v and v, and the 
equahty can be easily extended to any evaluation 9. 

W9 G A(W*), Vp G X, veip) = sup inf ( 9irip, a, 6) + (1 - 9i)ve+{q{p, a)) ) . 

where vg+ is naturally linearly extended to A/(X). As a consequence if \E'(7r) has 
a general limit value so does the repeated game r(7r). 

Step 3 : Let us check that \E' satisfies the assumption of 13.191 Consider p, p' 
in X, a in A, and a > and /3 > 0. We have : 

\ar{p,a) — f3r{p',a)\ < sup \ar{p,a,b) — f3r{p' ,a,b)\ 

6eA(j) 

< sup I Va/^(fc,a^6) -/3p"=^(fc,a^6)| 

feGA(J) 

< sup y2 \(yp'' - l^p'^\ = \\(yp - f^p'Wi- 
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IR be in Di. 

deD 

< ^ ||a {q{p,a){k',d))k' - (3 {q{p' , a){k' , d))k'\\i 
deD 

^ 11"^' «)^^'' ^))'^' - '^p" «)(^'' ^))'^ 
^ E E E^(^'«)(^''^)i"p'-'^p"i = ii«p-/5p'iii 

deD fc'e-ft' keK 

So \E' = (X, A, q, r) has a general limit value and a general uniform value that 
we denote by v*. As a consequence, r(7r) has a general limit value v*{tt). 

Step 4 : Given e > 0, there exists a > and a strategy a in the MDP \l/(7r) 
such that the ^-payoff in the MDP is large : 76i(vr, cr) > w*(vr) —e whenever I{6) < 
a. Moreover if we look at the end of the proof of theorem 13.191 we can choose a 
to be induced by a deterministic play in the Gambling House T with state space 
Z = A/(X) X [0, 1]. As a consequence one can mimic a to construct a strategy 
a* in the original repeated game r(7r) such that : Vr G T, 76)(vr, a*, r) > v*{it) — e 
whenever I {9) < a. 

Step 5 : Finally we show that player 2 can also guarantee the value v* in 
the repeated game F. Note that in the repeated game he can not compute the 
state variable in A(i^) without knowing the strategy of player 1. Nevertheless 
he has no influence on the transition function so playing independently by large 
blocks will be sufficient for him in order to guarantee v*{tt). We use the following 
characterization of the value proved in Renault (2012a) : 

v*{-n) = inf supfm,„(7r). 

" m 

where Vm,n is the value of the game with payoff function the Cesaro mean of the 
stage payoffs between stages m and m + n. We proceed as in proposition 4.22 
of Renault 2012a. Fix uq > 1, then we consider the strategy r* which for each 
j G IV, plays optimally in the game with the evaluation the Cesaro mean for the 
payoffs on the block of stages = {uq^J — 1) + 1, rioj}. Since player 2 does not 
influence the state it is well defined and this strategy guarantees supi>Q Vt^no (z) 
for the overall Cesaro mean. 

Let 9 G A (IV*) and cr be a strategy of player 1. For each j > 1, denote by 9j 



Moreover, let (p : A(iC) — ) 
\aip{q{p,a)) - (3ip{q{p',a))\ 
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the minimum of 9 on the block = {{j — l)nQ + 1, jno}. We have 



+00 / yno 



7e(7r,a,r*) = ^iE^,^,^. ^ etg{kt,at,bt 

j=l \t={j-l)no+l 



hoo +00 



<y^no 9, sup ft,no(7r) + ^0 \dt+i - dt 

j=l - i=l 



t>0 

Given e, there exists rio such that sup^g ft,no(^) ^ f*(7r) + e. Fix a = and r* 
defined as before then for all 6 such that I{6) < a, we have 

sup 751(77, a, T*) < v*{n) + 2e, 

crgS 

and this concludes the proof of theorem 14.91 □ 

Example 4.10. The computation of the value in two-player repeated game with 
incomplete information is a difficult problem as shown in the next example intro- 
duced in Renault (2006) and partially solved by Horner et al. (2010). The value 
exists by a theorem in Renault (2006) but the value has been computed only for 
some values of the parameters. The set of states is i^' = {^1,^2}; the set of actions 
of player 1 is J = {T, 5}, the set of actions of player 2 is J = {L, i?}, and the 
payoff is given by 

L R L R 

T A 0\ T /O 

B lo oj ^"""^ B lo 1 



ki h 



2 



The evolution of the state does not depend on the actions : at each stage the state 
stays the same with probability p and changes to the other state with probability 
1—p. At each stage, both players observe the past actions played but only player 
1 is informed of the current state (with previous notation C = K x I x J and 
D = I X J). For each p G [0, 1], it defines a repeated game F^. In the case p = 1, 
the matrix is fixed for all the game thus it is a repeated game with incomplete 
information on one side a la Aumann Maschler (1995). For all other positive 
values of p, the process is ergodic so the limit value is constant, and it is sufficient 
to study the case p G [1/2, 1) by symmetry of the problem. Horner et al. (2010) 
proved that if p G [1/2,2/3), then the value is Vp = If p > 2/3, we do not 
know the value except for p*, the solution of 9p^ — 12p^ + 6p — 1 = 0, where one 
has f , 



P l-3p*+6(p*)2 • 
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